Goto

Collaborating Authors

 language code



Recursive Semantic Anchoring in ISO 639:2023: A Structural Extension to ISO/TC 37 Frameworks

Kilictas, Bugra, Alpay, Faruk

arXiv.org Artificial Intelligence

ISO 639:2023 unifies the ISO language-code family and introduces contextual metadata, but it lacks a machine-native mechanism for handling dialectal drift and creole mixtures. We propose a formalisation of recursive semantic anchoring, attaching to every language entity $χ$ a family of fixed-point operators $ϕ_{n,m}$ that model bounded semantic drift via the relation $ϕ_{n,m}(χ) = χ\oplus Δ(χ)$, where $Δ(χ)$ is a drift vector in a latent semantic manifold. The base anchor $ϕ_{0,0}$ recovers the canonical ISO 639:2023 identity, whereas $ϕ_{99,9}$ marks the maximal drift state that triggers a deterministic fallback. Using category theory, we treat the operators $ϕ_{n,m}$ as morphisms and drift vectors as arrows in a category $\mathrm{DriftLang}$. A functor $Φ: \mathrm{DriftLang} \to \mathrm{AnchorLang}$ maps every drifted object to its unique anchor and proves convergence. We provide an RDF/Turtle schema (\texttt{BaseLanguage}, \texttt{DriftedLanguage}, \texttt{ResolvedAnchor}) and worked examples -- e.g., $ϕ_{8,4}$ (Standard Mandarin) versus $ϕ_{8,7}$ (a colloquial variant), and $ϕ_{1,7}$ for Nigerian Pidgin anchored to English. Experiments with transformer models show higher accuracy in language identification and translation on noisy or code-switched input when the $ϕ$-indices are used to guide fallback routing. The framework is compatible with ISO/TC 37 and provides an AI-tractable, drift-aware semantic layer for future standards.


Learn and Don't Forget: Adding a New Language to ASR Foundation Models

Qian, Mengjie, Tang, Siyuan, Ma, Rao, Knill, Kate M., Gales, Mark J. F.

arXiv.org Artificial Intelligence

Foundation ASR models often support many languages, e.g. 100 languages in Whisper. However, there has been limited work on integrating an additional, typically low-resource, language, while maintaining performance on the original language set. Fine-tuning, while simple, may degrade the accuracy of the original set. We compare three approaches that exploit adaptation parameters: soft language code tuning, train only the language code; soft prompt tuning, train prepended tokens; and LoRA where a small set of additional parameters are optimised. Elastic Weight Consolidation (EWC) offers an alternative compromise with the potential to maintain performance in specific target languages. Results show that direct fine-tuning yields the best performance for the new language but degrades existing language capabilities. EWC can address this issue for specific languages. If only adaptation parameters are used, the language capabilities are maintained but at the cost of performance in the new language.


Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation

Frohmann, Markus, Sterner, Igor, Vulić, Ivan, Minixhofer, Benjamin, Schedl, Markus

arXiv.org Artificial Intelligence

Segmenting text into sentences plays an early and crucial role in many NLP systems. This is commonly achieved by using rule-based or statistical methods relying on lexical features such as punctuation. Although some recent works no longer exclusively rely on punctuation, we find that no prior method achieves all of (i) robustness to missing punctuation, (ii) effective adaptability to new domains, and (iii) high efficiency. We introduce a new model - Segment any Text (SaT) - to solve this problem. To enhance robustness, we propose a new pretraining scheme that ensures less reliance on punctuation. To address adaptability, we introduce an extra stage of parameter-efficient fine-tuning, establishing state-of-the-art performance in distinct domains such as verses from lyrics and legal documents. Along the way, we introduce architectural modifications that result in a threefold gain in speed over the previous state of the art and solve spurious reliance on context far in the future. Finally, we introduce a variant of our model with fine-tuning on a diverse, multilingual mixture of sentence-segmented data, acting as a drop-in replacement and enhancement for existing segmentation tools. Overall, our contributions provide a universal approach for segmenting any text. Our method outperforms all baselines - including strong LLMs - across 8 corpora spanning diverse domains and languages, especially in practically relevant situations where text is poorly formatted. Our models and code, including documentation, are available at https://huggingface.co/segment-any-text under the MIT license.


ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages using Wikidata

Sälevä, Jonne, Lignos, Constantine

arXiv.org Artificial Intelligence

We introduce ParaNames, a massively multilingual parallel name resource consisting of 140 million names spanning over 400 languages. Names are provided for 16.8 million entities, and each entity is mapped from a complex type hierarchy to a standard type (PER/LOC/ORG). Using Wikidata as a source, we create the largest resource of this type to date. We describe our approach to filtering and standardizing the data to provide the best quality possible. ParaNames is useful for multilingual language processing, both in defining tasks for name translation/transliteration and as supplementary data for tasks such as named entity recognition and linking. We demonstrate the usefulness of ParaNames on two tasks. First, we perform canonical name translation between English and 17 other languages. Second, we use it as a gazetteer for multilingual named entity recognition, obtaining performance improvements on all 10 languages evaluated.


Fumbling in Babel: An Investigation into ChatGPT's Language Identification Ability

Chen, Wei-Rui, Adebara, Ife, Doan, Khai Duy, Liao, Qisheng, Abdul-Mageed, Muhammad

arXiv.org Artificial Intelligence

Recently, ChatGPT has emerged as a powerful NLP tool that can carry out several tasks. However, the range of languages ChatGPT can handle remains largely a mystery. In this work, we investigate ChatGPT's language identification abilities. For this purpose, we compile Babel-670, a benchmark comprising $670$ languages representing $23$ language families. Languages in Babel-670 run the gamut between the very high-resource to the very low-resource and are spoken in five continents. We then study ChatGPT's (both GPT-3.5 and GPT-4) ability to (i) identify both language names and language codes (ii) under both zero- and few-shot conditions (iii) with and without provision of label set. When compared to smaller finetuned language identification tools, we find that ChatGPT lags behind. Our empirical analysis shows the reality that ChatGPT still resides in a state of potential enhancement before it can sufficiently serve diverse communities.


Lego-MT: Learning Detachable Models for Massively Multilingual Machine Translation

Yuan, Fei, Lu, Yinquan, Zhu, WenHao, Kong, Lingpeng, Li, Lei, Qiao, Yu, Xu, Jingjing

arXiv.org Artificial Intelligence

Multilingual neural machine translation (MNMT) aims to build a unified model for many language directions. Existing monolithic models for MNMT encounter two challenges: parameter interference among languages and inefficient inference for large models. In this paper, we revisit the classic multi-way structures and develop a detachable model by assigning each language (or group of languages) to an individual branch that supports plug-and-play training and inference. To address the needs of learning representations for all languages in a unified space, we propose a novel efficient training recipe, upon which we build an effective detachable model, Lego-MT. For a fair comparison, we collect data from OPUS and build a translation benchmark covering 433 languages and 1.3B parallel data. Experiments show that Lego-MT with 1.2B parameters brings an average gain of 3.2 spBLEU. It even outperforms M2M-100 with 12B parameters. The proposed training recipe brings a 28.2$\times$ speedup over the conventional multi-way training method.\footnote{ \url{https://github.com/CONE-MT/Lego-MT}.}


ParaNames: A Massively Multilingual Entity Name Corpus

Sälevä, Jonne, Lignos, Constantine

arXiv.org Artificial Intelligence

We introduce ParaNames, a multilingual parallel name resource consisting of 118 million names spanning across 400 languages. Names are provided for 13.6 million entities which are mapped to standardized entity types (PER/LOC/ORG). Using Wikidata as a source, we create the largest resource of this type to-date. We describe our approach to filtering and standardizing the data to provide the best quality possible. ParaNames is useful for multilingual language processing, both in defining tasks for name translation/transliteration and as supplementary data for tasks such as named entity recognition and linking. We demonstrate an application of ParaNames by training a multilingual model for canonical name translation to and from English. Our resource is released under a Creative Commons license (CC BY 4.0) at https://github.com/bltlab/paranames.


Translating with Google Sheets

#artificialintelligence

Google Translate is an amazing feat of engineering, which uses artificial intelligence to translate speech and text from a chosen language into another. In most cases, Google Translate's own interface embedded in Google Search or on translate.google.com Again, as in other Case Studies presented here, Google Sheets comes to the rescue! Other than formatting the file to your liking, you can create some drop-down lists for the Source and Target languages. This will help you being more productive as you do not need to search for the language codes every time you want to change them. In my case, I used the Data Validation feature using as a criterion a List from a Range.


Build a Translation Application with AWS

#artificialintelligence

Amazon's suite of ML services is constantly expanding. From having capabilities of building custom ML pipelines in SageMaker to a versatile set of AutoML services, options to deploy and tackle ML problems are limitless. Neural Machine Translation is a theoretically intense field and requires deep knowledge of LSTMs and Deep Learning frameworks such as TensorFlow and PyTorch. For this article we will explore AWS Translate, a Neural Machine Translation tool that supports 71 languages and lets you build applications with a simple API call. This article is a continuation of the Auto-ML on AWS series, check out the Rekognition and Comprehend articles for the first two parts.